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Abstract 


A  popular  approach  to  high  dimensional  control  problems  in  robotics 
uses  a  library  of  candidate  “maneuvers”  or  “trajectories” lfl3l  l28l.  The  li¬ 
brary  is  either  evaluated  on  a  fixed  number  of  candidate  choices  at  runtime 
( e.g .  path  set  selection  for  planning)  or  by  iterating  through  a  sequence  of 
feasible  choices  until  success  is  achieved  (e.g.  grasp  selection).  The  perfor¬ 
mance  of  the  library  relies  heavily  on  the  content  and  order  of  the  sequence 
of  candidates.  We  propose  a  provably  efficient  method  to  optimize  such  li¬ 
braries  leveraging  recent  advances  in  optimizing  sub-modular  functions  of 
sequences  ll29l.  This  approach  is  demonstrated  on  two  important  problems: 
mobile  robot  navigation  and  manipulator  grasp  set  selection.  In  the  first  case, 
performance  can  be  improved  by  choosing  a  subset  of  candidates  which  op¬ 
timizes  the  metric  under  consideration  (cost  of  traversal).  In  the  second  case, 
performance  can  be  optimized  by  minimizing  the  depth  the  list  is  searched 
before  a  successful  candidate  is  found.  Our  method  can  be  used  in  both  on¬ 
line  and  batch  settings  with  provable  performance  guarantees,  and  can  be 
run  in  an  anytime  manner  to  handle  real-time  constraints. 


1  Introduction 

Many  approaches  to  high  dimensional  robotics  control  problems  such  as 
grasp  selection  for  manipulation  |4j[T5  ]  and  trajectory  set  generation  for  au¬ 
tonomous  mobile  robot  navigation  Ifl6l  use  a  library  of  candidate  “maneu¬ 
vers”  or  “trajectories”.  Such  libraries  effectively  discretize  a  large  control 
space  and  enable  tasks  to  be  completed  with  reasonable  performance  while 
still  respecting  computational  constraints.  The  library  is  used  by  evaluat¬ 
ing  a  fixed  number  of  candidate  maneuvers  at  runtime  or  iterating  through 
a  sequence  of  choices  until  success  is  achieved.  Performance  of  the  library 
depends  heavily  on  the  content  and  order  of  the  sequence  of  candidates. 

This  class  of  problems  can  be  framed  as  list  optimization  problems  where 
the  ordering  of  the  list  heavily  influences  both  performance  and  computation 
time  (23 .  In  such  settings  queries  arrive  sequentially  and  decisions  have  to 
be  taken  at  each  time  step  either  by  searching  a  list  until  success  is  achieved 
(grasp  selection,  novelty  detection)  or  generating  a  subset  of  the  list  to  be 
evaluated  (trajectory  set  generation).  For  problems  such  as  grasp  selection 
where  the  system  is  searching  for  the  first  successful  grasp  in  a  list  of  candi¬ 
date  grasps,  performance  is  dependent  on  the  depth  in  the  list  that  has  to  be 
searched  before  a  successful  answer  can  be  found.  For  problems  where  the 
system  must  generate  a  subset  of  a  bigger  list  to  be  evaluated,  performance 
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is  dependent  on  the  subset  maximizing  a  metric.  For  the  case  of  trajectory 
set  generation  this  corresponds  to  coming  up  with  a  set  of  trajectories  such 
that  the  computed  cost  of  traversal  of  the  robot  is  minimized. 

As  we  show  below,  maneuver  library  optimization  problems  exhibit  the 
property  of  monotone  sequence  submodularity  lfl4l  [29l :  the  value  of  adding 
each  addition  element  to  the  list  yields  diminishing  returns  over  earlier  addi¬ 
tions.  For  example,  in  the  case  of  grasp  selection,  adding  a  candidate  grasp 
to  a  pre-existing  large  list  of  grasps  does  not  increase  the  chance  of  select¬ 
ing  a  successful  grasp  as  much  as  adding  the  candidate  grasp  to  a  smaller 
subsequence  of  that  list  would. 

In  this  paper  we  take  advantage  of  recent  advances  in  submodular  se¬ 
quence  function  optimization  ||29ll  to  propose  an  approach  to  high-dimensional 
robotics  control  problems  that  leverages  the  online  and  submodular  nature 
of  list  optimization.  The  results  of  Streeter  et  al.  Il29l  establish  algorithms 
that  are  near-optimal  (within  known  NP-hard  approximation  bounds)  in  both 
a  fixed  design  and  no-regret  sense.  Such  results  may  be  somewhat  unsatis¬ 
factory  for  the  control  problem  we  address  as  we  are  concerned  about  per¬ 
formance  on  future  data  and  thus  we  consider  two  such  batch  settings:  static 
optimality  where  we  consider  a  distribution  over  training  examples  that  are 
independently  and  identically  distributed  (i.i.d)  (grasp  selection)  and  a  form 
of  dynamic  optimality  where  the  distribution  of  examples  is  influenced  by 
the  execution  of  the  control  libraries.  We  show  that  online-to-batch  conver¬ 
sions  |[6j  combined  with  the  advances  in  online  submodular  function  maxi¬ 
mization  enable  us  to  effectively  optimize  these  control  libraries. 

For  the  trajectory  sequence  selection  problem,  we  show  that  our  ap¬ 
proach  exceeds  the  performance  of  the  current  state  of  the  art  by  achiev¬ 
ing  lower  cost  of  traversal  in  a  real-world  path  planning  scenario  llT6ll.  For 
grasp  selection  (related  to  the  Min-SUM  SUBMODULAR  COVER  problem) 
we  show  that  we  can  adaptively  reorganize  a  list  of  grasps  such  that  the 
depth  traversed  in  the  list  until  a  successful  grasp  is  found  is  minimized. 
Our  approach  outperforms  approaches  such  as  random  grasp  orderings  or 
orderings  that  rank  grasps  by  average  rate  of  success.  We  emphasize  that 
although  our  approach  in  both  cases  is  online  in  nature,  it  can  operate  in 
an  offline  mode  where  the  system  is  trained  using  prior  collected  data  to 
learn  to  optimize  a  list  of  candidate  grasps  or  trajectories.  During  runtime 
the  learned  static  list  (or  distribution  over  lists)  can  then  be  utilized  for  all 
queries  without  incorporating  additional  performance  feedback. 

In  Section  [2]  we  briefly  review  the  use  of  control  libraries  for  various 
robotics  problems.  In  Section[3]we  review  the  concept  of  sequence  submod¬ 
ularity  and  outline  the  approach  for  online  monotone  sequence  submodular 
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function  optimization  we  use  in  the  two  application  domains.  Section [4] and 
Section  [5]  explains  the  versions  of  the  approach  specifically  applied  to  the 
two  domains,  experimental  set-up  and  performance  results  compared  with 
the  state-of-the-art  in  the  domains  of  path  planning  and  grasp  selection.  We 
conclude  in  Section  [6]  with  future  work  in  this  area. 


2  Control  Libraries 

Control  libraries  approximate  a  large  (possibly  infinite)  set  of  feasible  con¬ 
trollers  by  sampling  and  storing  a  (relatively)  small  number  of  controllers. 
At  runtime,  the  goal  becomes  to  find  the  “best”  element  of  the  library  to  ex¬ 
ecute  from  that  library.  If  the  library  was  chosen  well,  an  approximation  to 
the  optimal  control  can  be  found  in  the  library  while  maintaining  a  limit  on 
computation. 

Stolle  et  al  ||28l  have  used  trajectory  libraries  from  expert  demonstration 
to  find  suitable  control  policies  in  high  dimensional  spaces  for  a  walking 
robot  in  an  anytime  manner.  Frazzoli  et  al  |[T3ll  recorded  expert  human  pi¬ 
lots  aggressively  flying  small  unmanned  vehicles  and  created  a  library  of 
trajectory  primitives.  Given  a  planning  task,  a  feasible  trajectory  could  then 
be  quickly  generated  using  a  concatenation  of  these  stored  trajectories.  The 
advantage  of  such  a  library  was  that  each  of  the  stored  trajectory  primitives  is 
guaranteed  to  be  dynamically  feasible  and  hence  a  new  trajectory  generated 
by  a  concatenation  of  such  primitives  is  also  dynamically  feasible,  subject 
to  certain  transition  constraints. 

Grasp  sequence  ranking  is  usually  accomplished  by  evaluating  the  force 
closure  and  stability  criterion  for  all  grasps  within  a  library,  then  executing 
the  one  with  the  highest  score  HI  I26ll7ll8i.  Goldfeder  et  al .  |[T5]|  store  a  library 
of  precomputed  grasps  for  a  wide  variety  of  objects.  Given  a  novel  object 
they  find  the  closest  object  in  the  library  and  use  the  grasps  associated  with 
that  object  to  suggest  a  grasp  for  the  new  scenario. 

In  path  planning  for  mobile  robot  navigation,  one  of  the  most  power¬ 
ful  methods  leverages  a  “local  planner”  flTTl  that  evaluates  a  sequence  of 
traiectoriesllT6ll  to  identify  the  best  trajectory  amongst  these  and  then  ad¬ 
vances  a  portion  of  this  trajectory.  This  is  executed  in  a  receding-horizon 
fashion.  Various  methods  have  been  proposed  for  generating  suitable  se¬ 
quences  of  trajectories  offline  lfT6l  [51  [TOl ,  but  because  in  many  cases  this 
entire  set  of  stored  trajectories  maybe  evaluated  for  each  situation,  the  size 
of  this  library  is  strictly  limited  by  available  online  computation. 

A  fundamental  question  remaining  is  how  such  control  libraries  should 


3 


be  constructed  and  organized  in  order  to  maximize  the  performance  on  the 
task  at  hand  while  minimizing  search  time.  We  provide  a  data-driven  ap¬ 
proach  to  the  problem  of  constructing  and  optimizing  control  libraries. 


3  Review  of  Submodularity  and  Maximiza¬ 
tion  of  Submodular  functions 

A  function  /  :  SS  — >  [0, 1]  is  monotone  submodular  for  any  sequence  S  £  5S 
where  5?  is  the  set  of  all  sequences  if  it  satisfies  the  following  two  properties: 

•  (Monoticity)  for  any  sequence  Sj,S2  £  /(Si)  <  /(Si  US2)  and 

/(52)</(SiUS2) 

•  (Submodularity)  for  any  sequence  Si,S2  <G  5?,  /(Si)  and  any  action 
a  G  V  x  R>0,  /(Si  U  S2  U  (a) )  -  /(S 1  U  S2)  <  /(Si  U  (a) )  -  /(S 1 ) 

where  U  means  order  dependent  concatenation  of  lists. 

In  the  online  setting  a- regret  is  defined  as  the  difference  in  the  perfor¬ 
mance  of  an  algorithm  and  a  times  the  performance  of  the  best  expert  in 
retrospect.  Streeter  et  al.  Il29l  provide  algorithms  for  maximization  of  sub¬ 
modular  functions  whose  a -regret  (regret  with  respect  to  proven  NP-hard 
bounds)  approaches  zero  as  a  function  of  time. 

We  review  here  the  relevant  parts  of  the  online  submodular  function 
maximization  approach  as  detailed  by  |[29l.  Assume  we  have  a  list  of  fea¬ 
sible  control  actions  sf ,  a  sequence  of  tasks  and  a  list  of  actions  of 

length  N  that  we  maintain  and  present  for  each  task.  One  of  the  key  compo¬ 
nents  of  this  approach  makes  use  of  the  idea  of  an  expert  algorithm.  In  this 
approach,  the  order  of  the  selected  list  for  each  task  is  chosen  by  N  expert 
algorithms,  each  of  whom  gives  out  a  piece  of  advice  for  its  assigned  slot 
in  the  list.  The  algorithm  runs  N  distinct  copies  of  this  expert  algorithm: 
S’i  ,  £2,  ■  ■ . ,  Zy,  where  each  expert  algorithm  S,-  maintains  a  distribution  over 
the  set  of  possible  experts  (in  this  case  action  choices).  Just  after  task  ft  ar¬ 
rives  and  before  the  correct  sequence  of  actions  to  take  for  this  task  is  shown, 
each  expert  algorithm  S’i  selects  a  control  action  a\.  The  list  order  used  on 
task  ft  is  then  St  =  {a\  .af  ■ . .  ,dN}.  At  the  end  of  step  t,  the  value  of  the 
reward  x\  for  each  expert  i  is  made  public  and  is  used  to  update  each  expert 
accordingly. 

When  it  is  possible  to  evaluate  the  marginal  reward  for  each  expert 
(action/control  primitive)  for  every  slot  the  randomized  weighted  majority 
(WMR)  l(22l  may  serve  as  the  experts  algorithm  subroutine  as  it  needs  full 
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information  feedback.  In  the  trajectory  selection  case  one  would  have  to 
evaluate  all  the  feasible  trajectories  for  the  robot.  This  is  an  option  in  the 
offline  case  when  time  is  not  a  constraint  but  is  not  feasible  online.  It  is 
hence  desirable  to  use  an  experts  algorithm  subroutine  which  requires  the 
marginal  rewards  of  only  those  actions  which  arc  chosen  to  populate  the 
sequence  (EXP3  Ell. 


4  Application:  Mobile  robot  navigation 

Traditionally,  path  planning  for  mobile  robot  navigation  is  done  in  a  hierar¬ 
chical  manner  with  a  global  planner  at  the  top  level  driving  the  robot  in  the 
general  direction  of  the  goal  while  a  local  planner  makes  sure  that  the  robot 
avoids  obstacles  while  making  progress  towards  the  goal.  The  local  planner 
runs  at  a  high  frequency  and  at  every  time  step  evaluates  a  set  of  feasible 
control  trajectories  on  the  immediate  perceived  environment  to  find  the  tra¬ 
jectory  yielding  the  least  cost  of  traversal.  The  robot  then  moves  along  the 
trajectory,  which  has  the  least  sum  of  cost  of  traversal  and  cost  to  go  to  the 
goal  from  the  end  of  the  trajectory  for  one  time  step.  This  process  is  then 
repeated  at  each  time  step. 

This  set  of  feasible  trajectories  is  usually  computed  offline  by  sampling 
from  a  much  larger  (possibly  infinite)  set  of  feasible  trajectories.  Such 
library-based  model  predictive  approaches  are  widely  used  in  state-of-the- 
art  systems  leveraged  by  most  DARPA  Urban  Challenge,  Grand  Challenge 
(including  the  two  highest  placing  teams  for  ho t h )  |[32l  [23 1  [Til  ■  3 0 '1  as  well  as 
on  sophisticated  outdoor  vehicles  (LAGRlfT8l.  UPIlf3l.  PerceptorfTfll)  devel¬ 
oped  in  the  last  decade.  A  particularly  effective  method  for  generating  such 
a  library  is  to  generate  the  set  of  trajectories  greedily  such  that  the  area  be¬ 
tween  the  trajectories  is  maximized  Ifl6l.  As  this  method  runs  offline,  it  does 
not  adapt  to  changing  conditions  in  the  environment  nor  is  it  data-driven  to 
perform  well  on  the  environments  encountered  in  practice. 

Let  cost  (at )  be  the  cost  of  traversing  along  trajectory  a,-  sampled  from  the 
possible  set  of  trajectories.  Let  N  be  the  budgeted  number  of  trajectories  that 
can  be  evaluated  during  real-time  operation.  Lor  a  given  set  of  trajectories 
{ai,a2, ...,ajv}  sampled  from  the  set  of  all  feasible  trajectories,  we  define 
the  monotone,  submodular  function  that  we  maximize  using  the  lowest-cost 
path  from  the  set  of  possible  trajectories  as  /  :  5?  -»  [0, 1] : 

A0  — min(cost(ai),cost(ci2 ), . .  .,cost(ciff)) 

K  ( 
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where  Na  is  a  constant  normalizer  which  is  the  highest  cost  trajectory  that 
can  be  expected  for  a  given  cost  map. 

We  present  the  general  algorithm  for  online  selection  of  action  sequences 
in  A 1  go ri tlim |T]  The  inputs  to  the  algorithm  are  the  number  of  action  prim¬ 
itives  N  which  can  be  evaluated  at  runtime  within  the  computational  con¬ 
straints  and  N  copies  of  experts  algorithms,  S\ ,  , . . . ,  Sff,  one  for  each  posi¬ 

tion  of  the  sequence  of  actions  desired.  The  experts  algorithm  subroutine 
can  be  either  Randomized  Weighted  Majority  (WMR)ll22ll  ( Algorithm  (3]) 
or  EXP3  0(Algorithm|2]l.  T  represents  the  number  of  planning  steps  the 
robot  is  expected  to  carry  out.  In  lines [Tp]  a  sequence  of  trajectories  is 
sampled  from  the  current  distribution  of  weights  over  trajectories  main¬ 
tained  by  each  copy  of  the  expert  algorithm  using  the  function.  Function 
sampleActionE xpert s(S'i )  samples  the  distribution  of  weights  over  experts 
(trajectories)  maintained  by  experts  algorithm  copy  A/  to  fill  in  slot  i  (S't)  of 
the  sequence  without  repeating  trajectories  selected  for  slots  before  the  ith 


slot. 


The  sequence  of  trajectories  S’  is  evaluated  on  the  current  environment 
around  the  robot  in  linej6]to  find  the  trajectory  a*  which  has  the  least  sum  of 
cost  of  traversal  and  cost  to  go  to  the  goal  from  the  end  of  the  trajectory.  This 
trajectory  is  then  traversed  for  the  time  A t  until  the  next  planning  cycle. 

As  a  consequence  of  traveling  the  best  trajectory  a*  the  robot  encounters 
the  next  environment  ENV  (line  J7]).  In  lines |8p3]  each  of  the  experts  algo¬ 
rithms  weights  over  all  feasible  trajectories  are  increased  if  the  monotone 
submodular  function  f  is  increased  by  adding  trajectory  a'j  at  the  ith  slot. 

The  function  sampleActionExperts  in  the  case  of  EXP3  corresponds  to 
executing  linesflplof  Algorithmj2]  For  WMR  this  corresponds  to  executing 
line [T] of  Algorithm{3]  Similarly  the  function  updateWeight  corresponds  to 
executing  lines|3]|6|of  Algorithm|2]or  lines j3}|4] of  Algorithm(3] 

The  learning  rate  e  for  WMR  is  set  to  be  -U  where  T  is  the  number 
of  planning  cycles,  possibly  infinite.  For  infinite  or  unknown  planning  time 
this  can  be  set  to  4=  where  t  is  the  current  time  step.  Similarly  the  mixing 
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parameter  7  for  EXP3  is  set  as  min 


Note  that  actions  are  generic  and  in  the  case  of  mobile  robot  naviga¬ 
tion  are  trajectory  primitives  from  the  control  library.  Eater  on  in  Section{5] 
actions  are  grasps  that  the  manipulator  can  execute. 

T  can  be  possibly  infinite  as  a  ground  robot  can  be  run  for  abitrary 
amounts  of  time  with  new  goals  or  waypoints  presented  to  the  robot  ev¬ 
ery  time  the  current  goal  is  achieved.  Since  the  choice  of  T  influences  the 
learning  rate  of  the  approach  it  is  necessary  to  account  for  the  possibility  of 
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T  being  infinite. 

As  mentioned  in  Sectionj3]WMR  may  be  too  computationally  costly  for 
online  applications  as  it  requires  the  evaluation  of  every  trajectory  at  every 
point  in  the  list  whether  executed  or  not.  EXP3,  by  contrast,  learns  more 
slowly  but  requires  as  feedback  only  the  cost  of  the  sequence  of  trajecto¬ 
ries  actually  executed,  and  hence  add  negligible  overhead  on  the  trajectory 
library  approach.  For  EXP3  line{9]would  loop  over  only  the  experts  chosen 
at  the  current  time  step  instead  of  \srf\. 

We  refer  to  this  sequence  optimization  algorithm  in  the  rest  of  the  paper 
as  SEQOPT. 


Algorithm  1  Algorithm  for  trajectory  sequence  selection 
Require:  number  of  trajectories  N,  experts  algorithms  subroutine  copies  (Algo¬ 
rithms  j2] and [3])  <§\ ,  . . . ,  S’n 

1:  for  t  =  1  to  T  do 
2:  for  i—ltoNdo 

3:  cij  =  sampleActionExperts(S’i) 

4:  Sf  i —  Clj 

5:  end  for 

6:  a*  —  evaluateActionSequence(EN\,St) 

7:  ENV  =  get  Next  Environment  (a* ,  At) 

8:  for  i  —  1  to  N  do 

9:  for  j  =  1  to  \srf\  do 

10:  reward'j  =  /,(sf_1>  U alj)  - 

11:  wlj  A-  u pdateW eight  {reward1- .  wlj) 

12:  end  for 

13:  end  for 

14:  end  for 


SEQOPT:  the  approach  detailed  here  and  inherited  from  1291  is  an  on¬ 
line  algorithm  which  produces  a  sequence  which  converges  to  the  greedy 
sequence  as  the  time  horizon  grows.  The  greedy  sequence  is  guaranteed 
to  achieve  at  least  1  —  1/e  of  the  value  of  the  optimal  list  ffTTl.  Therefore 
SEQOPT  is  a  0  a-regret  (for  a  =  1  —  1/e  here)  algorithm.  This  implies 
that  its  a-regret  goes  to  0  at  a  rate  of  0(\/y/T)  for  T  interactions  with  the 
environment. 

We  are  also  interested  in  its  performance  with  respect  to  future  data  and 
hence  consider  notions  of  near-optimality  with  respect  to  distributions  of 
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Algorithm  2  Experts  Algorithm:  Exponential-weight  algorithm  for  Exploration 
and  Exploitation  (EXP3)  f[2]| 

Require:  y  e  (0, 11,  initialization  w ,•  =  1  for  7  =  1, . . . ,  \x/\ 

1:  Setp7-  =  (I-7)  J-  +  ^7  = 

2:  Randomly  sample  i  according  to  the  probabilities  pi, . . . .  pi^i 
3:  Receive  reward j  G  [0, 1] 

4:  for  j  —  1  to  \s#\  do 

5: 

f  f-  if  r  =  i 
xt  —  \  p> 

l  0  otherwise 

6:  w,  <-  wfexp(yg|) 

7:  end  for 


Algorithm  3  Experts  Algorithm:  Randomized  Weighted  Majority  (WMR)  1(22 1 

Require:  Initialization  wj  =  1  for  /  =  1 .....  |.s/| 

1:  Randomly  sample  j  according  to  the  distribution  of  weights  vv  1 . . . .  ,w\^\ 

2:  Receive  rewards  for  all  experts  reward \ , . . . ,  reward^ 

3:  for  j  —  1  to  \£?\  do 

4: 

f  Wj(  1  +  e)rewardj  if  reward  j  >  0 

\  Wj  (1  —  £  j  — reward  j  p'  rewar(j.  <  () 


5:  end  for 
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Figure  1:  The  cost  of  traversal  of  the  robot  on  the  cost  map  of  Fort  Hood,  TX  using 
trajectory  sequences  generated  by  different  methods  for  30  trajectories  per  time 
step  over  1055788  planning  cycles  in  4396  runs.  Constant  curvature  trajectories 
result  in  the  highest  cost  of  traversal  followed  by  Green-Kelly  path  sets.  Our 
sequence  optimization  approach  (SEQOPT)  using  EXP3  as  the  experts  algorithm 
subroutine  results  in  the  lowest  cost  of  traversal  (8%  lower  than  Green-Kelly)  with 
negligible  overhead. 


environments.  We  define  a  statically  optimal  sequence  of  trajectories  Sso  6 
y  as: 

Sso  =  argmax  Ed(Em)  [/(ENV,  S)]  (2) 

s 

where  4  (ENV)  is  a  distribution  of  environments  that  are  randomly  sam¬ 
pled.  The  trajectory  sequence  S  is  evaluated  at  each  location.  A  statically 
near-optimal  trajectory  sequence  Sso  thus  approximately  maximizes  the  ex¬ 
pectation  of  the  Equation [23]  (E’j(env)  [/(ENV, 5)])  over  the  distribution  of 
environments  ENV,  effectively  optimizing  the  one-step  cost  of  traversal  at 
the  locations  sampled  from  the  distribution  of  the  environments. 

Knepper  et  al.  l(20l  note  that  sequences  of  trajectories  are  generally  de¬ 
signed  for  this  kind  of  static  planning  paradigm  but  are  used  in  a  dynamic 
planning  paradigm  where  the  library  choice  influences  the  examples  seen 


9 


(a)  Constant  Curvature  (b)  Constant  Curvature  (c)  Green-Kelly  fd)  Green-Kelly  Den- 
Density  sity 


(e)  seqopt  (EXP3)  (f)  seqopt  (EXP3)  (g)  seqopt  (EXP3)  (h)  SEQOPT  (EXP3) 
Dynamic  Dynamic  Density  Static  Static  Density 


Figure  2:  The  density  of  distribution  of  trajectories  learned  by  our  approach 
(SEQOPT  using  EXP3)  for  the  dynamic  planning  paradigm  in  Figure  j2e] shows  that 
most  of  the  trajectories  are  distributed  in  the  front  whereas  for  the  static  paradigm 
they  are  more  spread  out  to  the  side.  This  shows  that  for  the  dynamic  case  more 
trajectories  should  be  put  in  the  front  of  the  robot  as  obstacles  are  more  likely  to 
occur  to  the  side  as  pointed  out  by  Knepper  et  al  ll20l 
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(a)  Cost  map  of  Fort  Hood,  Texas  (b)  Robot  simulated  driving  over  cost 

map  towards  goal 


Figure  3:  We  used  a  real-world  cost  map  of  Fort  Hood,  Texas(|3a|  and  simulated  a 
robot  driving  over  the  map  to  random  goal  locations  from  random  start  locations 
pb|)  using  trajectory  sequences  generated  by  different  methods  for  comparision. 
The  trajectory  in  green  is  evaluated  to  be  the  least  cost  for  the  pictured  planning 
cycle. 


and  that  there  is  little  correlation  in  performance  between  good  static  and 
good  dynamic  performance  for  a  sequence.  Our  approach  bridges  this  gap 
by  allowing  offline  batch  training  on  a  fixed  distribution,  or  allowing  sam¬ 
ples  to  be  generated  by  running  the  currently  sampled  library. 

We  then  define  a  weakly  dynamically  optimal  trajectory  sequence  Swdo  £ 
y  as: 

Swdo  =  argmax[/(ENV,  S )]  (3) 

s 

where  d(ENV|7r)  is  defined  as  the  distribution  of  environments  that  are  in¬ 
duced  by  the  robot  following  the  policy  71.  The  policy  n  corresponds  to  the 
robot  following  the  least  cost  trajectory  within  Swdo  at  each  situation  encoun¬ 
tered.  Hence  a  weakly  dynamically  optimal  trajectory  sequence  minimizes 
the  cost  of  traversal  of  the  robot  at  all  the  locations  which  the  robot  encoun¬ 
ters  as  a  consequence  of  executing  the  policy  n.  We  define  this  as  weakly 
dynamically  optimal  as  there  can  be  other  trajectory  sequences  S  £  5?  that 
can  minimize  the  cost  of  traversal  with  respect  to  the  distribution  of  environ¬ 
ments  induced  by  following  the  policy  %. 

Knepper  et  al  l20l  further  note  the  surprising  fact  that  for  a  vehicle  fol¬ 
lowing  a  reasonable  policy,  averaged  over  time-steps  the  distribution  of  ob¬ 
stacles  encountered  ends  up  heavily  weighted  to  the  sides.  Good  earlier  pol¬ 
icy  choices  imply  that  the  space  to  the  immediate  front  of  the  robot  is  mostly 
devoid  of  obstacles.  It  is  effectively  a  chicken-egg  problem  to  find  such  a 
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Figure  4:  As  the  number  of  trajectories  evaluated  per  planning  cycle  are  increased 
the  cost  of  traversal  for  trajectory  sequences  generated  by  Green-Kelly  and  our 
method  drops  and  at  80-100  trajectories  achieve  almost  the  same  cost  of  traversal. 
It  is  to  be  noted  that  our  approach  decreases  the  cost  of  traversal  much  faster  than 
Green-Kelly  trajectory  sequences. 


policy  with  respect  to  its  own  induced  distribution  of  examples,  which  we 
address  here  as  weakly  dynamically  optimality. 

We  briefly  note  the  following  propositions  about  the  statistical  perfor¬ 
mance  of  Algorithm  1.  We  elide  full  proofs  to  the  appendix,  but  note  that 
they  follow  from  recent  results  of  online-to-batch  learning,  j25ll  combined 
with  the  regret  guarantees  of  Il29ll  on  the  objective  functions  we  present. 
Proposition  1.  (Approximate  Static  Optimality)  If  getNextEnvironment 
returns  independent  examples  from  a  distribution  over  environments  (i.e., 
the  chosen  control  does  not  affect  the  next  sample),  then  for  a  list  S  chosen 
randomly  from  those  generated  throughout  the  T  iterations  of  the  Algorithm 
1,  it  holds  that£d(ENV)[(l  -  l/e)f{S*)  - f(S )]  <  0(/^^)  with  probability 

greater  then  1  —  5. 

Proposition  2.  (Approximate  Weak  Dynamic  Optimality)  If  getNextEn¬ 
vironment  returns  examples  by  forward  simulating  beginning  with  a  random 
environment  and  randomly  choosing  a  new  environment  on  reaching  a  goal, 
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then  consider  the  policy  7rmixtUre  that  begins  each  new  trial  by  choosing  a  list 
randomly  from  those  generated  throughout  the  T  iterations  of  the  Algorithm 
1.  By  the  no-regret  property,  such  a  mixture  policy  will  be  a  approximately 
dynamically  optimal  in  expectation  up  to  an  additive  term  0(— )  with 
probability  greater  then  1  —  5.  Further,  in  the  (experimentally  typical)  case 
where  the  distribution  over  library  sequences  converges,  the  resulting  single 
list  is  (up  to  approximation  factor  a)  weakly  dynamically  optimal. 


4.1  Experimental  setup 


We  simulated  a  robot  driving  over  a  real-world  cost  map  generated  for  Fort 
Hood,  Texas  (Figurej3])  with  trajectory  sequences  generated  by  using  the 
method  devised  by  Green  et  al.  lfl6l  for  both  constant  curvature  arcs  (Fig- 
and  trajectories  comprised  of  concatenation  of  arcs  of  different 


ures 


2a 


curvatures  (Figures{2c}[2d|).  The  cost  map  and  parameters  for  the  local  plan¬ 
ner  (number  of  trajectories  to  evaluate  per  time  step,  length  of  the  trajecto¬ 
ries,  fraction  of  trajectory  traversed  per  time  step)  were  taken  to  most  closely 
match  that  given  in  Bagnell  et  al.  IQ. 


4.2  Results 

4.2.1  Dynamic  Simulation 

Figure|l]shows  the  cost  of  traversal  of  the  robot  with  different  trajectory  sets 
as  a  function  of  number  of  runs.  Each  run  constitutes  the  robot  starting  from 
a  random  starting  location  and  ending  at  the  specified  goal  on  the  map.  100 
goal  locations  and  50  start  locations  for  every  goal  location  were  chosen  at 
random.  The  set  of  weights  for  the  N  copies  of  experts  algorithm  EXP3  were 
carried  over  through  consecutive  runs. 

The  cost  of  traversal  of  constant  curvature  trajectory  sequences  grows  at 
the  highest  rate  followed  by  using  the  Green-Kelly  path  set.  The  lowest  cost 
of  traversal  is  achieved  by  running  Algorithm^  with  EXP3  as  the  experts 
algorithm  subroutine.  At  the  end  of  4396  runs  there  is  a  8%  reduction  in  cost 
of  traversal  between  Green-Kelly  and  our  approach  (SEQOPT  using  EXP3). 
It  is  to  be  emphasized  that  improvement  in  path  planning  is  obtained  with 
negligible  overhead.  Though  the  complexity  of  our  approach  scales  linearly 
in  the  number  of  motion  primitives  and  depth  of  the  library,  each  operation 
is  simply  a  multiplicative  update  and  a  sampling  step.  In  practice  it  was 
not  possible  to  evaluate  even  a  single  extra  motion  primitive  in  the  time 
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overhead  that  our  approach  requires.  In  1  millisecond  100000  update  steps 
can  be  performed  using  Exp3  as  the  experts  algorithm  subroutine. 

4.2.2  Static  Simulation 

In  addition  to  the  dynamic  simulations  we  also  performed  a  static  simula¬ 
tion  where  for  100  goal  locations  the  robot  was  placed  at  500  random  poses 
in  the  cost  map  and  the  cost  of  traversal  of  the  selected  trajectory  a*  over 
the  next  planning  cycle  was  recorded.  SEQOPT  with  EXP3  and  Green- Kelly 
sequences  obtained  0.5%  and  0.25%  lower  cost  of  traversal  than  constant 
curvature  sequences  respectively.  The  performance  for  all  three  methods 
was  essentially  at  par.  This  can  be  explained  by  the  fact  that  Green-Kelly 
trajectory  sequences  are  essentially  designed  to  handle  the  static  case  of 
planning  where  trajectories  must  provide  adequate  density  of  coverage  in 
all  directions  as  the  distribution  of  obstacles  is  entirely  unpredictable  in  this 
case. 

In  the  dynamic  planning  case  on  the  other  hand,  the  situations  the  robot 
encounters  are  highly  correlated  and  because  the  robot  is  likely  to  be  guided 
by  a  global  trajectory,  a  local  planner  that  tracks  that  trajectory  well  will 
likely  benefit  from  a  higher  density  of  trajectories  toward  the  front  as  most 
of  the  obstacles  will  be  to  the  sides  of  the  path.  This  is  evident  by  the  den¬ 
sities  of  generated  trajectory  sequences  for  each  case  as  shown  in  Figurej2] 
Our  approach  naturally  deals  with  this  divide  between  the  static  and  dy¬ 
namic  planning  paradigms  by  adapting  the  chosen  trajectory  sequence  at  all 
times.  A  video  demonstration  of  the  algorithm  can  be  found  at  the  following 

link:  [Q 


5  Application:  Grasp  selection  for  manipu¬ 
lation 

Most  of  the  past  work  on  grasp  set  generation  and  selection  have  focused  on 
automatically  producing  a  successful  and  stable  grasp  for  a  novel  object,  and 
the  computational  time  is  of  secondary  concern.  As  a  result  very  few  grasp 
selection  algorithms  have  attempted  to  optimize  the  order  of  consideration 
in  grasp  databases.  Berenson  et  al.  101  dynamically  ranked  pre-computed 
grasps  by  calculating  a  grasp-score  based  on  force  closure,  robot  position, 
and  environmental  clearance.  Ratliff  et  al.  Il24l  employed  imitation  learn¬ 
ing  on  demonstrated  example  grasps  to  select  a  grasp  in  a  discretized  grasp 
space.  In  both  of  these  cases  the  entire  library  of  grasps  is  evaluated  for 
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each  new  environment  or  object  at  run  time,  and  the  order  of  the  entries  and 
their  effect  on  computation  are  not  considered.  In  this  section  we  describe 
our  grasp  ranking  procedure,  which  uses  trajectory  generation  success  to 
reorder  a  list  of  database  grasps,  so  that  for  a  majority  of  situation  encoun¬ 
tered  in  a  new  environment,  only  a  subset  of  grasp  entries  near  the  front  of 
the  control  library  need  to  be  evaluated. 

For  a  sequence  of  grasps  5  G  5?  we  define  the  submodular  monotone 
grasp  selection  function  /  :  5?  — >  [0, 1]  as  /  =  P(S)  where  P(S)  is  the  prob¬ 
ability  of  successfully  grasping  an  object  in  a  given  scenario  using  the  se¬ 
quence  of  grasps  provided. 

For  a  given  sequence  of  grasps  5  G  5?  we  want  to  minimize  the  cost  of 
evaluating  the  sequence  i.e.  minimize  the  depth  in  the  list  that  has  to  be 
searched  until  a  successful  grasp  is  found.  Thus  the  cost  of  a  sequence  of 
grasps  can  be  defined  as  c  =  YOLo  1  where  /(5/a)  is  defined  as  the 

value  of  the  submodular  function  /  on  executing  sequence  5  £  5?  up  to  (i) 
slots  in  the  sequence.  Minimizing  c  corresponds  to  minimizing  the  depth  i 
in  the  sequence  of  grasps  that  must  be  evaluated  for  a  successful  grasp  to  be 
found.  (We  assume  that  every  grasp  takes  equal  time  to  evaluate) 

The  same  algorithm  for  trajectory  sequence  generation  (Algorithm JT|)  is 
used  here  grasp  sequence  generation.  Here  the  set  of  experts  for  each  ex¬ 
perts  algorithm  are  the  set  of  grasps  in  the  grasp  library.  Here  each  experts 
algorithm  Si  maintains  a  set  of  weights  for  each  grasp  (expert)  in  the  library. 
A  sequence  of  grasps  is  constructed  by  sampling  without  repetition  the  dis¬ 
tribution  of  weights  for  each  grasp  S\  for  each  position  i  in  the  sequence 
(lines jT]{5]).  This  sequence  is  evaluated  on  the  current  environment  until  a 
successful  grasp  a*  is  found  (linej6j).  Not  that  the  next  environment  is  then 
presented  to  the  robot  from  a  distribution  of  environments  and  is  not  ob¬ 
tained  by  evaluating  the  successful  grasp  unlike  the  path  planning  case.  If 
the  sucessful  grasp  was  found  at  position  i  in  the  sequence  then  in  experts 
algorithm  Si  the  weight  corresponding  to  the  successful  grasp  id  is  updated 
using  SEQOPT  with  EXP3’s  update  rule.  For  WMR  all  the  grasps  in  the  se¬ 
quence  are  evaluated  and  the  weights  for  every  expert  are  updated  according 
to  lines  l9l[T2l 

5.1  Experimental  setup 

We  performed  experiments  using  environments  containing  a  trigger-style 
flashlight  as  the  target  object.  We  used  the  OpenRAVElf9l  simulation  frame¬ 
work  to  generate  a  multitude  of  different  grasps  and  environments  for  each 
object.  The  manipulator  used  in  this  experiment  is  a  Barret  WAM  arm  and 
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Figure  5:  Example  grasps  from  the  grasp  library  sequence.  Each  grasp  has  a 
different  approach  direction  and  finger  joint  configuration  recorded  with  respect 
to  the  object’s  frame  of  reference.  Our  algorithm  attempts  to  re-order  the  grasp 
sequence  to  quickly  cover  the  space  possible  scenarios  with  fewer  grasps  at  the 
front  of  the  sequence. 


hand  with  a  fixed  base,  and  a  3D  joystick  is  used  to  control  the  simulated 
robot  in  grasp  sequence  generation.  Since  the  grasps  are  generated  by  a  hu¬ 
man  operator,  we  assume  they  are  stable  grasps  and  hence  the  main  failure 
mode  is  in  trajectory  planning  and  obstacle  collision.  During  both  training 
and  testing,  bidirectional  RRT  li2Tft  is  used  to  generate  the  trajectory  from 
the  manipulator’s  current  position  to  the  target  grasp  position. 

The  grasp  library  consisted  of  60  grasps  and  the  library  was  evaluated 
on  50  different  environments  for  training,  and  50  environments  for  testing. 
For  a  particular  environment/grasp  pair  the  grasp  success  is  evaluated  by  the 
success  of  Bi-RRT  trajectory  generation,  and  the  grasp  sequence  ordering  is 
updated  at  each  timestep  of  training.  For  testing  and  during  run-time,  the 
learned  list  was  evaluated  without  further  feedback. 

We  compared  the  performance  of  SEQOPT  with  EXP3  as  well  as  WMR 
as  expert  subroutines  to  two  methods  of  grasp  library  ordering:  a  random 
grasp  ordering,  and  an  ordering  of  the  grasps  by  decreasing  success  rate 
across  all  examples  in  training  (which  we  will  call  “frequency”).  At  each 
time  step  of  the  training  process,  a  random  environment  was  selected  from 
the  training  set  and  each  of  the  four  grasp  sequence  orderings  were  evalu¬ 
ated.  The  search  depth  for  each  test  case  was  Racked  to  compute  overall 
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Figure  6:  Executing  a  particular  grasp  in  both  simulation  and  real  hardware.  The 
grasp  library  ordering  is  trained  in  simulation  only,  and  the  resulting  grasp  se¬ 
quence  can  be  executed  in  hardware  without  modifications. 


performance.  The  performance  of  the  two  naive  ordering  methods  does  not 
improve  over  time  because  the  frequency  method  is  a  single  static  sequence 
and  the  random  method  has  a  uniform  distribution  over  all  possible  rankings. 

5.2  Results 

The  performance  of  each  sequence  after  training  is  shown  in  Figure  [7J  We 
can  clearly  see  a  dramatic  improvement  in  the  performance  of  SEQOPT  run 
with  both  WMR  and  EXP3  update  rules  over  the  random  and  frequency 
methods.  While  random  and  frequencymethods  produce  a  grasp  sequence 
ordering  that  requires  an  average  of  about  7  evaluations  before  a  successful 
grasp  is  found,  SEQOPT  with  WMR  and  EXP3  produce  a  more  optimized 
ordering  that  require  only  about  5  evaluations  which  is  29%  improvement. 
Since  evaluating  a  grasp  entails  planning  to  the  goal  and  executing  the  actual 
grasp  this  improvement  is  significant  speedup  in  finding  a  successful  grasp. 
Again  this  improvement  comes  at  negligible  cost  and  in  practice  it  wasnt 
possible  to  evaluate  a  single  extra  grasp  in  the  extra  time  overhead  for  our 
approach. 

It  is  interesting  to  note  that  a  random  ordering  of  the  grasps  has  sim¬ 
ilar  performance  to  the  frequency  method.  This  is  because  similar  grasps 
tend  to  be  correlated  in  their  success  and  failure,  so  the  grasps  in  the  front 
of  the  frequency  ordering  tend  to  be  similar.  When  the  first  grasp  fails,  the 
next  few  arc  likely  to  fail  as  well,  increasing  the  average  search  depth.  The 
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SEQOPT  algorithm  solves  this  correlation  problem  by  ordering  the  grasp  se¬ 
quence  such  that  the  grasps  near  the  front  of  the  library  cover  the  space  of 
possible  configurations  as  quickly  as  possible.  A  video  demonstration  of  the 
algorithm  can  be  found  at  the  following  link:  lITil 


Average  Search  Depth 


Freq 


Rand 


SeqOpt 

(EXP3) 

SeqOpt 

(WMR) 


Figure  7:  Average  depth  till  successful  grasp  for  flashlight  object  with  50  test 
environments.  The  training  data  shows  the  average  search  depth  achieved  at  the 
end  of  the  training  session  over  50  training  environments.  Algorithm{I]  (SEQOPT) 
when  run  with  EXP3  as  the  experts  algorithm  subroutine  achieves  20%  reduction 
over  grasp  sequences  arranged  by  average  rate  of  success  (Freq.)  or  a  random 
ordering  of  the  grasp  list  (Rand.) 


6  Conclusion 

We  have  shown  an  efficient  method  for  optimizing  performance  of  control 
libraries  and  have  attempted  to  answer  the  question  of  how  to  contruct  and 
order  such  libraries. 
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The  grasp  sequence  generation  method  presented  here  does  not  incorpo¬ 
rate  the  context  in  which  the  object  is  placed  in  the  current  environment.  We 
aim  to  modify  the  current  approach  to  close  the  loop  with  perception  and 
take  account  of  features  in  the  environment  for  grasp  sequence  generation. 

As  robots  employ  increasingly  large  control  libraries  to  deal  with  the  di¬ 
versity  and  complexity  of  environments  that  they  may  encounter,  approaches 
such  as  the  ones  presented  here  will  become  crucial  to  maintaining  robust 
real-time  operation. 


7  Acknowledgements 

This  work  was  funded  by  the  Army  Research  Laboratories  through  the  Robotics 
Collaborative  Technology  Alliance  (R-CTA)  and  the  Defense  Advanced  Re¬ 
search  Projects  Agency  through  the  Autonomous  Robotic  Manipulation  Soft¬ 
ware  Track  (ARM-S). 


References 

[1]  URL  http :  / /youtube  .  com/robotcontroll 

[2]  P.  Auer,  N.  Cesa-Bianchi,  Y.  Freund,  and  R.E.  Schapire.  The  non¬ 
stochastic  multiarmed  bandit  problem.  SIAM  Journal  on  Computing, 
32(1  ):48— 77,  2003. 

[3]  J.A.  Bagnell,  D.  Bradley,  D.  Silver,  B.  Sofman,  and  A.  Stentz.  Learn¬ 
ing  for  autonomous  navigation.  Robotics  Automation  Magazine,  IEEE, 
17(2):74  -84,  june  2010. 

[4]  D.  Berenson,  R.  Diankov,  K.  Nishiwaki,  S.  Kagami,  and  J  Kuffner. 
Grasp  planning  in  complex  scenes.  In  IEEE-RAS  Humanoids,  Decem¬ 
ber  2007. 

[5]  M.S.  Branicky,  R.A.  Knepper,  and  J.J.  Kuffner.  Path  and  trajectory 
diversity:  Theory  and  algorithms.  In  ICRA ,  pages  1359-1364.  IEEE, 
2008. 

[6]  N.  Cesa-Bianchi,  A.  Conconi,  and  C.  Gentile.  On  the  generaliza¬ 
tion  ability  of  on-line  learning  algorithms.  Information  Theory,  IEEE 
Transactions  on,  50(9):2050  -  2057,  2004. 


19 


[7]  E.  Chinellato,  R.B.  Fisher,  A.  Morales,  and  A.P.  del  Pobil.  Ranking 
planar  grasp  configurations  for  a  three-finger  hand.  In  ICRA ,  volume  1, 
pages  1133-1138.  IEEE,  2003. 

[8]  M.T.  Ciocarlie  and  PK.  Allen.  On-line  interactive  dexterous  grasping. 
In  EuroHaptics,  page  104,  2008. 

[9]  R  Diankov.  Automated  Construction  of  Robotic  Manipulation  Pro¬ 
grams.  PhD  thesis,  Carnegie  Mellon  University,  Robotics  Institute, 
August  2010. 

[10]  L.H.  Erickson  and  S.M.  LaValle.  Survivability:  Measuring  and  ensur¬ 
ing  path  diversity.  In  ICRA ,  pages  2068-2073.  IEEE,  2009. 

[11]  U.  Feige.  A  threshold  of  In  n  for  approximating  set  cover.  JACM,  45 
(4):634-652,  1998. 

[12]  U.  Feige,  L.  Lovasz,  and  P.  Tetali.  Approximating  min  sum  set  cover. 
Algorithmica,  40(4) :2 19-234,  2004. 

[13]  E.  Frazzoli,  MA  Dahleh,  and  E.  Feron.  Robust  hybrid  control  for  au¬ 
tonomous  vehicle  motion  planning.  In  Decision  and  Control,  2000., 
volume  1,  2000. 

[14]  S.  Fujishige.  Submodular functions  and  optimization.  Elsevier  Science 
Ltd,  2005. 

[15]  C.  Goldfeder,  M.  Ciocarlie,  J.  Peretzman,  H.  Dang,  and  PK.  Allen. 
Data-driven  grasping  with  partial  sensor  data.  In  IROS,  pages  1278- 
1283.  IEEE,  2009. 

[16]  C.  Green  and  A.  Kelly.  Optimal  sampling  in  the  space  of  paths:  Prelim¬ 
inary  results.  Technical  Report  CMU-RI-TR-06-51,  Robotics  Institute, 
Pittsburgh,  PA,  November  2006. 

[17]  T.  Howard,  C.  Green,  and  A.  Kelly.  State  space  sampling  of  feasi¬ 
ble  motions  for  high  performance  mobile  robot  navigation  in  highly 
constrained  environments.  In  FSR,  July  2007. 

[18]  LD  Jackel  et  al.  The  DARPA  LAGR  program:  Goals,  challenges, 
methodology,  and  phase  I  results.  JFR,  23(1 1- 12):945— 973,  2006. 

[19]  Alonzo  Kelly  et  al.  Toward  reliable  off  road  autonomous  vehicles  op¬ 
erating  in  challenging  environments.  IJRR,  25(l):449-483,  May  2006. 


20 


[20]  R.  Knepper  and  M.T.  Mason.  Path  diversity  is  only  paid  of  the  problem. 
In  ICRA,  May  2009. 

[21]  Jr.  Kuffner,  J.J.  and  S.M.  LaValle.  Rrt-connect:  An  efficient  approach 
to  single-query  path  planning.  In  ICRA,  volume  2,  pages  995  -1001, 
2000. 

[22]  N.  Littlestone  and  M.K.  Warmuth.  The  Weighted  Majority  Algorithm. 
INFORMATION  AND  COMPUTATION,  108:212-261,  1994. 

[23]  M.  Montemerlo  et  al.  Junior:  The  Stanford  entry  in  the  urban  challenge. 
JFR,  25(9):569-597,  2008. 

[24]  N.  Ratliff,  J.A.  Bagnell,  and  S.  Srinivasa.  Imitation  learning  for  lo¬ 
comotion  and  manipulation.  Technical  Report  CMU-RI-TR-07-45, 
Robotics  Institute,  Pittsburgh,  PA,  December  2007. 

[25]  S.  Ross,  G.J.  Gordon,  and  J.A.  Bagnell.  No-Regret  Reductions 
for  Imitation  Learning  and  Structured  Prediction.  Arxiv  preprint 
arXiv:  101 1.0686,  2010. 

[26]  J.P  Saut  and  D.  Sidobre.  Efficient  Models  for  Grasp  Planning  With 
A  Multi-fingered  Hand.  In  Workshop  on  Grasp  Planning  and  Task 
Learning  by  Imitation,  volume  2010,  1918. 

[27]  B.  Sofman,  J.A.  Bagnell,  and  A.  Stentz.  Anytime  online  novelty  de¬ 
tection  for  vehicle  safeguarding.  In  ICRA,  May  2010. 

[28]  M.  Stolle  and  C.G.  Atkeson.  Policies  based  on  trajectory  libraries.  In 
ICRA,  pages  3344-3349.  IEEE,  2006. 

[29]  M.  Streeter  and  D.  Golovin.  An  online  algorithm  for  maximizing  sub- 
modular  functions.  In  NIPS,  pages  1577-1584,2008. 

[30]  Sebastian  Thrun  et  al.  Stanley:  The  robot  that  won  the  darpa  grand 
challenge:  Research  articles.  J.  Robot.  Syst.,  23:661-692,  September 
2006. 

[31]  Christopher  Urmson  et  al.  A  robust  approach  to  high-speed  navigation 
for  unrehearsed  desert  terrain.  JFR.  23(1):467— 508,  August  2006. 

[32]  Christopher  Urmson  et  al.  Autonomous  driving  in  urban  environments: 
Boss  and  the  urban  challenge.  JFR,  25(1):425^166,  June  2008. 


21 


A  Proof  of  monotone  submodularity 

A  sequence  function  /  which  maps  subsequences  si  C  y  of  a  finite  se¬ 
quence  y  to  the  real  numbers.  /  is  called  submodular  if,  for  all  si  C  PS  C  y 
and  y  E  "V  \PS  it  holds  that 

f(si®^)-f{si)>f(PS®^)-f{^)  (21) 

where  ®  is  the  concatenation  operator.  Such  a  function  is  monotone  if  it 
holds  that  for  any  sequences  PS\ ,  'Pj  E  'A,  we  have 

/(^i)</(^t©^2)  (22) 

/(^2)</(^l©^2) 

For  the  trajectory  selection  case  we  want  to  prove  that  /  (in  equation  1  re¬ 
stated  here)  is  monotone,  submodular. 

f  _  N0  -  mmaieA(cost(ai)) 

N0 

where  A  is  the  set  of  all  feasible  actions  or  control  sequences.  This  can 
be  proved  if  minaie4 (cost (a/))  is  monotone, supermodular.  A  function  /  is 
supermodular  if  it  holds  that 

f{sP®^)-f{si)<f{PS®^)-f{PS)  (24) 

Theorem  1.  The  function  min^^i  [cost  (a,))  is  monotone,  supermodular  where 
fl,  are  trajectories  sampled  from  the  set  of  all  feasible  trajectories. 

Proof  Submodularity 

Assume  that  we  are  given  sequences  si  C^C  y ,  6P  ^y\PS.  We 
want  to  prove  the  inequality  in  equationj24]  Let  PS  =  PS  \  si,  the  set  of 
elements  that  are  in  PS  but  not  in  si.  Since  si  ®  PS  =  PS  we  can  now  rewrite 
equation|24|as 

f(si®y)-f(si)  <  f  {si® PS 9^)- f  {si® PS)  (25) 

We  refer  to  the  left  and  right  sides  of  equation|25]  as  LffS  and  RHS  respec¬ 
tively.  Define  a*  as  the  trajectory  which  has  the  least  cost  when  evaluated 
on  a  given  environment.  Hence  there  can  be  three  cases: 

•  Case  1 :  a*  E  si  In  this  case  LHS  =  RHS  =  0 

•  Case  2:  a*  E  PS  In  this  case  RHS  >  LHS 
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•  Case  3:  a*  E  A  In  this  case  RHS  >  LHS 

Since  in  all  possible  cases  it  can  be  seen  that  RHS  is  greater  than  or 
equal  to  LHS  it  is  proved  that  minai (cost (a,))  is  supermodular.  Note  that  if 
there  are  multiple  trajectories  which  have  the  same  minimum  cost  as  a*  then 
similar  arguments  still  hold  and  in  the  worst  case  when  they  are  distributed 
across  A  Case  1  holds. 

Monotonicity  Consider  two  sequences  ,5©,  and  A&.  Define  a*  as  the 
trajectory  which  has  the  least  cost  when  evaluated  on  a  given  environment. 

We  want  to  prove  that  rnirp,,^  (cost  (a  f)  is  monotone  decreasing,  ie. 

f(A])>f(A]®A2)  (26) 

f(A2)>f(Ai®A2) 

Hence  there  can  be  three  cases: 

•  Case  \:a*e  A\  =►  f(Ax)  =  f(A1®A2)  and  f{A2)  >f{Ax®A2) 

•  Case  2:  a*  €  =>  f{A\ )  >  f{A\  ®A2)  and  f(A2)  =  f{A\  ®A2) 

•  Case  3:  a*  €  A\  ©  A2  =►  f(A j)  =  f(A]  ©  A2)  and  f(A2)  = 
f(A \®A2) 

Since  in  all  possible  cases  the  conditions  in  equation|26]are  satisfied  mina(g,\  {cost(cii)) 
is  monotone  decreasing.  □ 

Corollary  1.  The  function  f  in  equation\23\ in  the  paper  is  monotone,  sub- 
modular  due  to  minai04 (cost(ai))  being  monotone,  supermodular  by  Theo¬ 
rem J7] 

□ 


B  Background 

Following  a  similar  analysis  to  Ross  et  al.  ll25l  We  define  the  loss  function 
as  the  difference  between  the  maximization  of  /env  that  is  achieved  by  ex¬ 
ecuting  the  sequence  S  and  that  achieved  by  executing  the  greedy  sequence 

(1  -  l/e)/ENv(S*) 

/(ENV, S')  =  [(1  -  l/e)fEsv(S*)-fEsv(S)]  (27) 

Here  S*  is  the  best  sequence  that  maximizes  /env-  The  (1  —  1/e)  term  is 
due  to  the  fact  that  we  are  competing  with  respect  to  the  greedy  sequence 
and  not  the  best  sequence.  This  is  because  finding  the  best  sequence  has 


23 


been  proven  to  be  /VP-hard  but  the  greedy  sequence  has  the  property  that  it 
achieves  performance  that  is  at  least  (1  —  \/e)  (  63%)  of  the  best  sequence 
by  Feige  et  al.  lfT2ll. 

A  no-regret  algorithm  produces  a  sequence  of  policies  TZ\,  Ttj,  ... ,  7 It 
such  that  the  regret  with  respect  to  the  best  policy  in  hindsight  goes  to  0  as  T 
goes  to  00.  In  the  case  of  SEQOPT  the  policy  at  a  time  step  corresponds  to 
the  sequence  of  trajectories  which  are  evaluated  and  the  minimum  cost  tra¬ 
jectory  executed  till  the  next  time  step.  SEQOPT  is  a  no  a -regret  algorithm 
which  converges  to  the  greedy  list  and  hence  a  =  (1  —  1/e)  for  SEQOPT. 
This  can  be  formalized  as: 

if  li(Si)-mm±;'Eli(S)<Yr  (28) 

1  “j  sey  1  w 

for  liinr-r^Tr  =  0  We  choose  the  loss  functions  to  be  the  expected  loss  un¬ 
der  the  distribution  of  environments,  /,-(£)  =  E^env)  [/ (ENV; .  S)]  We  also 
define  =  minsey  4£^1£rf(ENV)[Z(ENV,S')]  as  the  loss  of  the  best  policy 
in  hindsight  after  T  iterations. 

C  Proof  of  Proposition  1 

Proposition  1.  (Approximate  Static  Optimality)  If  getNextEnvironment 
returns  independent  examples  from  a  distribution  over  environments  (i.e., 
the  chosen  control  does  not  affect  the  next  sample),  then  for  a  list  S  chosen 
randomly  from  those  generated  throughout  the  T  iterations  of  the  Algorithm 

1,  it  holds  that  Ed(Em)[(\  -  l/e)fEw(S*)  -/env(S)]  <  0(\J 2^3)  with 
probability  greater  then  1  —  5. 

Proof.  If  we  can  evaluate  the  expectation  over  the  distribution  of  environ¬ 
ments  exactly  (infinite  number  of  samples)  then  we  have  the  following  the¬ 
orem: 

Theorem  2.  For  SEQOPT,  for  the  case  when  environments  are  indepen¬ 
dently  sampled  from  a  distribution  of  environments  there  exists  a  sequence 
S  G  Si  t  s.t.  ErffENV) [/ ( EN V , S) ]  <  /f  +  7r  when  the  expectation  over  the 
distribution  of  environments  can  be  exactly  evaluated. 
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Proof. 


min  £j(eNv)[^(ENV,S)]  (29) 

<  ^L^(env)[/(ENV,5)] 

1  i=i 

l  T 

<  7t  +  min  -  £  li(S)  {no  regret ) 

6&y  1  i= i 

<  Yt  +  It 


□ 

The  previous  results  hold  if  the  online  learning  algorithm  observes  the 
infinite  sample  loss,  i.e.  the  loss  on  the  true  distribution  of  environments. 
In  practice  however  the  algorithm  would  only  observe  its  loss  on  a  small 
sample  of  environments.  We  wish  to  bound  the  true  loss  under  the  distri¬ 
bution  of  environments  as  a  function  of  the  regret  on  the  finite  sample  of 
environments. 

If  we  assume  that  SEQOPT  samples  m  environments  in  every  one  of 
the  T  iterations  then  and  then  observes  the  loss  IfS)  =  Eti ,  E  NV )  [/  (EN V .  S ) 
for  D;.  the  dataset  of  these  m  environments.  We  restate  the  regret  definition 
using  this  finite  dataset  as  2  i  Ed,-(env)  [/  (ENV,  .S))]  —  minveiy  y  < 
7rEo,(ENV)  [l  (ENV .5)]  <  yT.  Let  If  =  vainsey  ^^jL^^envj^ENV^)] 
be  the  training  loss  of  the  best  policy  in  hindsight.  Then  we  have  for  the 
finite  sample  case  the  following  theorem: 

Theorem  3.  For  SEQOPT,  with  probability  at  least  1  —  5,  there  exists  a 

policy  S  G  Si:r  s.t.  £f/(ENv[;(ENV,S)]  <  If  +  yT  +  lmax\f^^- for  the  case 
when  environments  are  independently  sampled  from  a  distribution  of  envi¬ 
ronments,  when  the  distribution  of  environments  is  sampled  m  times. 

Proof.  Let  Yjj  be  the  difference  between  the  expected  per  step  loss  of  .S', 
under  the  environment  distribution  5  (ENV)  and  the  average  per  step  loss 
of  Sj  under  the  jth  sampled  environment  at  iteration  i.  The  random  vari¬ 
ables  Yjj  over  all  i  G  {I  . .  ,T\  and  j  G  1.2,..../??  arc  all  zero  mean  and 
bounded  in  [— lmaxJmin\  (here  lmax  =  1  as  each  /  is  bounded  between  0—1) 
and  form  a  martingale  in  the  order  Y\  \ ,  F12,  ■  • .  ,Tim,T2t,  •  •  •  •  Yjm ■  By  Azuma- 

Heoffding’s  inequality  JT  f_L]  L"Li  fj  <  Lax  with  Prob- 
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ability  at  least  1  —  5.  Hence  we  obtain  with  probability  at  least  1  —  5: 


< 


< 


< 


min  £rf(ENv)[/(ENV,S)] 

^E^(env)[/(ENV,S)] 

1  i=  1 

i  T  i  T  m 

r  I  ^A-(env)  [/(ENV ,  5)]  +  —  £  £  Yij 


i—  1 


i=l 7=1 


~E£A'(env)[/(ENV,S)]  + 


/—I 


mT 


Yt  +  1*t  +  ' 


'2 MjO 

mT 


(30) 


□ 

□ 


D  Proof  of  Proposition  2 

Proposition  2.  (Approximate  Weak  Dynamic  Optimality)  If  getNextEn- 
vironment  returns  examples  by  forward  simulating  beginning  with  a  random 
environment  and  randomly  choosing  a  new  environment  on  reaching  a  goal, 
then  consider  the  policy  7rmixture  that  begins  each  new  trial  by  choosing  a  list 
randomly  from  those  generated  throughout  the  T  iterations  of  the  Algorithm 
1.  By  the  no-regret  property,  such  a  mixture  policy  will  be  a  approximately 

dynamically  optimal  in  expectation  up  to  an  additive  term  0(  \J  h"  ^  )  with 

probability  greater  then  1  —  5.  Further,  in  the  (experimentally  typical)  case 
where  the  distribution  over  library  sequences  converges,  the  resulting  single 
list  is  (up  to  approximation  factor  a)  weakly  dynamically  optimal. 

Proof.  Consider  Km\xUire.  the  policy  that  begins  each  new  trial  of  Algorithm 
1  by  choosing  a  list  randomly  from  those  generated  throughout  the  T  itera¬ 
tions.  Let  dKmixlure  (ENV)  be  the  distribution  of  environments  encountered  as 
a  consequence  of  following  policy  Tlm,xture.  Let  5,  be  the  randomly  sampled 
list  at  the  ith  iteration.  Let  Y/j  be  the  difference  between  the  expected  per 
step  loss  of  Si  under  the  distribution  of  dKj  (ENV)  and  the  average  per  step 
loss  of  Sj  under  the  jth  sample  trajectory  of  horizon  H.  Note  that  here  each 
sample  j  is  a  sequence  of  environments  encountered  as  a  consequence  of  the 
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robot  following  the  mixture  policy  KmixUire  whereas  in  Proposition  1  each  jth 
sample  is  an  environment  sampled  from  a  distribution  of  environments.  Fol¬ 
lowing  the  mixture  policy  Km\xnire  corresponds  to  randomly  sampling  one  of 
the  lists  generated  by  Algorithm  1  during  T  iterations.  The  random  variables 
Yjj  over  all  i  G  {1 . . .  T}  and  j  <E  1,2, ...  ,m  arc  all  zero  mean  and  bounded 
in  [—lmaxjmm\  (here  lmax  =  1  as  each  /  is  bounded  between  0—1)  and 
form  a  martingale  (considering  the  order  Fn,Fi2,  •  •  •  ,Fim,^2i5 •  ■  •  ^Tm)-  By 

Azuma-Heoffding ’ s  inequality  ^LLi  E7L1  Y‘J  -  lmax\J =  \J^f^ 
with  probability  at  least  1  —  5.  Hence  we  obtain  with  probability  at  least 
1-5: 


< 


< 


^7Zi  ~  71 mixture  (env)[/(ENV,7T,-)] 

^I^,(env)[/(ENV,5,-)] 

1  i=  1 

i  T  i  T  m 

f  I£d,(env,['(ENV,5,)]  +  -=  £  £  Ylt 
1=1  i=l 7=1 


(31) 


□ 
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